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Abstract 

We propose a novel method to model nonlinear regression problems by adapting 
the principle of penalization to Partial Least Squares (PLS). Starting with a general- 
ized additive model, we expand the additive component of each variable in terms of 
a generous amount of B-Splines basis functions. In order to prevent overfitting and 
to obtain smooth functions, we estimate the regression model by applying a penaHzed 
version of PLS. Although our motivation for penalized PLS stems from its use for 
B-Splines transformed data, the proposed approach is very general and can be applied 
to other penalty terms or to other dimension reduction techniques. It turns out that 
penalized PLS can be computed virtually as fast as PLS. We prove a close connection 
of penaHzed PLS to the solutions of preconditioned Hnear systems. In the case of 
high-dimensional data, the new method is shown to be an attractive competitor to 
other techniques for estimating generalized additive models. If the number of pre- 
dictor variables is high compared to the number of examples, traditional techniques 
often suffer from overfitting. We illustrate that penalized PLS performs well in these 
situations. 
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1 Introduction 



Nonlinear regression effects may be modeled via additive regression models of the form 



where the functions fi, . . . , fp have unspecified functional form. An approach which allows 
flexible representation of the functions /i, • • • , /p is the expansion in basis functions (Hastie 
& Tibshirani 1990). To prevent overfitting, there are two general approaches. In the first 
approach, each function fj is the sum of only a small set of basis functions, 



The basis functions B^j are chosen adaptively by a selection procedure. The second ap- 
proach (that is outlined in Section 3) circumvents the problem of basis function selection. 
Instead, we allow a generous amount Kj ^ 1 of basis functions in the expansion (2). As 
this usually leads to high-dimensional and highly correlated data, we penalize the coeffi- 
cients Pjk in the estimation process (Eilers & Marx 1996). 

Quite generally, a different approach to deal with high dimensionality is to use di- 
mension reduction techniques such as Partial Least Squares (PLS) (Wold 1975, Wold 
et al. 1984). The main idea is to build a few components from the predictor variables 
and to regress y onto these components. A short overview on PLS can be found in Section 
2. 

As a linear approach, PLS probably fails to yield high prediction accuracy in the case of 
nonlinear relationships between predictors and responses as in (1). In order to incorporate 
nonlinear structures, it might be advisable to transform the original predictors preliminarily 
to a PLS regression. This approach has been proposed by Durand & Sabatier (1997) and 
Durand (2001) in different variants. The method proposed by Durand &; Sabatier (1997) 
is based on a variant of PLS that may be computed via an iterative algorithm. They 
suggest an approach that incorporates splines transformations of the predictors within 
each iteration of the iterative algorithm. In contrast, the method proposed by Durand 
(2001) is global. The predictors are first transformed using splines basis functions as a 
preliminary step, then PLS regression is performed on the transformed data matrix. The 
choice of the degree d of the polynomial pieces and of the number of knots is performed 
by an either ascending or descending search procedure that is not automatic. 

For large numbers of variables, this search procedure is computationally intensive and 
might overfit the training data. In the present article, we suggest an alternative approach 
based on the penalty strategy of Eilers & Marx (1996). As described in Section 3, we 
transform the initial data matrix nonlinearly using B-splines basis functions. Our new 
method, which we call penalized PLS, is based on the following principle. The equivalent 
of penalizing the (higher order) differences of adjacent B-splines coefficients is, in the 



Y = i3o + h{Xi) + --- + fp{Xp) + e. 



(1) 




(2) 



fc=i 
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framework of dimension reduction, the penalization of (higher order) differences of adjacent 
weights. 

In Section 4, we introduce an adaptation of the principle of penalization to PLS. More 
precisely, we present a penalized version of the optimization problem attached to PLS. 
Although the motivation stems from its use for B-splines transformed data, the proposed 
approach is very general and can be adapted to other penalty terms or to other dimension 
reduction techniques such as Principal Components Analysis. It turns out that the new 
method shares a lot of properties of PLS and that its computation requires virtually no 
extra costs. We highlighten the close connection between penalized PLS and preconditioned 
linear systems. It is already known that PLS is equivalent to the conjugate gradient method 
(Hestenes & Stiefel 1952) applied to the set of normal equations associated to a linear 
regression problem. We prove that penalized PLS corresponds to a conjugate gradient 
method for a preconditioned set of normal equations, where the preconditioner depends 
on the penalty term. Furthermore, we show that this new technique is closely related to 
the so-called kernel trick. More precisely, we prove that penalized PLS is equivalent to 
ordinary PLS using a generalized inner product that is defined by the penalty term. In 
Sections 5 and 6, we illustrate our method on different data sets. 

In the rest of the paper, we restrict ourselves to a univariate response. In Section 7, 
we stress that the extension of our method to a multivariate response is straightforward. 



2 Partial Least Squares Regression 

Let us consider the general linear regression problem. We want to predict a univariate 
response variable Y using p predictor variables Xi, . . . ,Xp based on a finite set 

of observations. We set 



X 



( 



G 



tnxp 



y 



yi 



\ Vn 



and require for simplicity of notation that both X and y are centered. If we assume 
that the relationship between predictors and response is linear, this relationship can be 
represented in compact form by 

y = Xf3 + e. 



Here, /3 is the p-dimensional vector of regression coefficients and e is the vector of residuals. 



When n < p, the usual regression tools such as ordinary least squares (OLS) regression 
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cannot be applied to estimate /3 since the p x p covariance matrix {l/n)X'^ X (which has 
rank at most n — 1) is singular. From a technical point of view, this may be solved by 
replacing the inverse of the covariance matrix by a generalized inverse. However, for n < p, 
OLS usually fits the training data perfectly and one cannot expect the method to perform 
well on a new data set. Partial Least Squares (PLS) (Wold 1975, Wold et al. 1984) is 
an alternative regression tool which is more appropriate in the case of highly correlated 
predictors and high-dimensional data. PLS is a standard tool for analyzing chemical data 
(Martens & Naes 1989), and in recent years, the success of PLS has lead to applications in 
other scientific fields such as physiology (Rosipal et al. 2003) or bioinformatics (Boulesteix 
& Strimmer 2006). 

The main idea of PLS is to build orthogonal components ti, . . . ,tm from the original 
predictors X and to use them as predictors in a least squares regression. There are different 
PLS techniques to extract these components, and each of them gives rise to a different 
variant of PLS. It is not our aim to explain all variants and we focus on two of them. 
An overview on different forms of PLS can be found in Rosipal & Kramer (2006). A 
component is a linear combination of the original predictors that hopefully reflects the 
relevant structure of the data. PLS is similar to Principal Components Regression (PGR). 
The difference is that PGR extracts components that explain the variance in the predictor 
variables whereas PLS extracts components that have a large covariance with y. We now 
formalize this concept. A latent component t is a linear combination t = Xw of the 
predictor variables. The vector w is usually called the weight vector. We want to find 
a component with maximal covariance to y, that is we want to maximize the empirical 
squared covariance 

coy"^ {Xw,y) = w^X^yy^Xw. 

We have to constrain w in order to obtain identifiability, choosing 

max 'uFx'^yy'^Xw , (3) 
subject to ||iy|| = l. (4) 

Using Lagrangian multipliers, we conclude that the solution wi is - up to a scaling factor 
- equal to X^y. 

Let us remark that (3) and (4) are equivalent to 

w^X^yy^Xw 

max Tf. . (5) 

w 

The solution of (5) is only unique up to a scalar. The normalization of the weight vectors 
w to length 1 is not essential for the PLS algorithm and PLS algorithms differ in the way 
they scale the weight vectors and components. In this paper, we present all algorithms 
without the scaling of the vectors, in order to keep the notation as simple as possible. 
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Subsequent components t2,ts, . . . are chosen such that they maximize (3) and that all 
components tj are mutually orthogonal. In PLS, there are different techniques to extract 
subsequent components, and each technique gives rise to a variant of PLS. We briefly 
introduce two of them. In the method called SIMPLS (de Jong 1993), one computes for 
the ith component, 

max w^X^yy^Xw , 

subject to \\w\\ = 1 and Xw _L tj,j < i . 

Alternatively, one can deflate the original predictor variables X. That is, we only consider 
the part of X that is orthogonal onto all components tj,j < i. For any matrix V, let us 
denote by Vv the orthogonal projection onto the space that is spanned by the columns of 
V. In matrix notation, we have 

= V{V^V)~V^. (6) 

The deflation of X with respect to the components ti, . . . is defined as 

Xi = X- Vt„...,u_,X = - Vu_,Xi_, . (7) 

For the computation of the ith component, X is replaced by Xi in (3). This method 
is called the NIPALS algorithm (Wold 1975). The two methods are equivalent if y is 
univariate in the sense that we end up with the same components ti (de Jong 1993). In 
this paper, we use the NIPALS algorithm. In summary, the PLS algorithm is described in 
algorithm 1. 

Algorithm 1 NIPALS algorithm 

Input: Xi = X, y, number of components m 
for i=l,. . . ,m do 

Wi = Xfy (weight vector) 

ti = XjiOj( component) 

Xi+i = Xi- Pt-Xj (deflation) 
end for 

PLS used to be overlooked by statisticians and was considered an algorithm rather than 
a sound statistical model. This attitude is in parts understandable, as in the early literature 
on the subject, PLS was explained solely in terms of formulas as in algorithm 1. Due to its 
success in applications, the interest in the statistical properties of PLS has risen. It can be 
related to other dimension reduction techniques such as Principal Components Regression 
and Ridge Regression and these methods can be cast under a unifying framework (Stone 
& Brooks 1990). The shrinkage properties of PLS have been studied extensively (Frank & 
Friedman 1993, de Jong 1995, Goutis 1996, Butler & Denham 2000). Furthermore, it can 
be shown that PLS is closely connected to Krylov subspaces and the conjugate gradient 
method (Helland 1988, Phatak & de Hoog 2003). We discuss this method in more detail 
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in Section 4. 

Let us return to the PLS algorithm. With 



T = 




denoting the collection of components, the fitted response is given by 



y = T{T^T)-^T^y = VTy. 



(8) 



In order to obtain the response for new observations, we have to determine the vector of 
regression coefficients y = X/3 . Therefore, a representation of the components tj = XiWi 
as a linear combination of the original predictors X is needed. In other words, we have to 
derive weight vectors Wi with 



They are in general different from the "pseudo" weight vectors Wi that are computed by the 
NIPALS algorithm. In order to avoid redundancy, the derivation of these weight vectors 
is deferred until Section 4. 

It should be noted that the number m of PLS components is an additional model 
parameter that has to be estimated. One way of determining m is by cross-validation. 

3 Penalized Regression Splines 

The fitting of generalized additive models by use of penalized regression splines has become 
a widely used tool in statistics. Starting with the seminal paper by Eilers & Marx (1996), 
the approach has been extended and applied in various publications (Ruppert 2002, Wood 
2000, Wood 2006). The basic concept is to expand the additive component of each variable 
Xj in basis functions as in (2) and to estimate the coefficients by penalization techniques. 
As suggested in Eilers & Marx (1996), B-splines are used as basis functions yielding so- 
called P-splines (for penalized B-splines). Splines are one-dimensional piecewise polynomial 
functions. The points at which the pieces are connected are called knots or breakpoints. 
We say that a spline is of order d if all polynomials are of degree < d and if the spline is 
{d — 1) times continuously differentiable at the breakpoints. A particular efficient set of 
basis functions are B-splines (de Boor 1978). The number of basis functions depends on the 
order of the splines and the number of breakpoints. For a given variable Xj, we consider a 
set of corresponding B-splines basis functions Bij, . . . ,BKj- These basis functions define 
a nonlinear map 



Xwi = XiWi . 



{Bij{x), . . .,BKj{x)) 



T 
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By performing such a transformation on each of the variables Xi,. . . ,Xp, the observation 
vector Xi turns into a vector 

Zi = {Bii{xii),...,Bml{Xil),...,Bip{xip),...,Bmp{Xip))'^ (9) 
= HXi) 

of length pK. Here $ is the function defined by the B-splines. The resulting data matrix 
obtained by the transformation of X has dimensions n x pK and will be denoted by Z in 
the rest of the paper. In the examples in Sections 5 and 6, we consider the most widely 
used cubic B-splines, i.e. we choose d = 3. 

The estimation of (1) is transformed into the estimation of the piC- dimensional vector 
that consists of the coefHcients Pj^: 

/3 = (^11, ■ ■ -jPki, ■ ■ ■ Pi2, ■ ■ ■,Pkp) = ( • • • ) • 
As explained above, the vector /3 determines a nonlinear, additive function 

p p K 

fix) =f3o + Yl M^j) =Po + Y.12 l^kjBkjixj) =Po + . 
j=i j=i k=i 

As Z is usually high-dimensional, the estimation of /3 by minimizing the squared error 

n 

-Y^iVi- f{xi)f = - ||y - ^0 - ZI3f 
n ^-^ n 

i=l 

usually leads to overfitting. Following Eilers & Marx (1996), we use for each variable many 
basis functions, say K !=a 20, and estimate by penalization. The idea is to penalize the 
second derivative of the function /. Eilers & Marx (1996) show that the following difference 
penalty term is a good approximation of the penalty on the second derivative of /, 

p rn 
j=l k=3 

These are also called the second-order differences of adjacent parameters. The difference 
operator A'^Pkj has the form 

^"^Pkj = iPkj - Pk-l,j) - {(3k-l,j - Pk-2,j) 
= Pkj - Wk-\,j + Pk-2,j- 

The coefHcients Xj > control the amount of penalization. This penalty term can be 
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expressed in terms of a penalty matrix P. We denote by Dk the {K — 1) x K matrix 

. . .\ 



Dk = 



( I -I . . 
1 -1 . 



v 



. 1 -1 ^ 

that defines the first order difference operator. Setting 

K2 = {DK-lDK fDK-lDK , 

we conclude that the penalty term equals 

P(/3) = Xj/3l)K2f3^j) = 0'{^x ® ^2)/3 . 

Here is the p x p diagonal matrix containing Ai, . . . , Ap on its diagonal and (8) is the 
Kronecker product. The generalization of this method to higher-order differences of the 
coefficients of adjacent B-splines is straightforward. We simply replace K2 by 

Kq = {DK-q+l . . . DKf{DK-q+l ...Dk). 

To summarize, the penalized least squares criterion has the form 



i?p(/3) = -||y-/3o-Z/3|p + /3^P/3 



n 



with the penalty matrix P defined as 



P = A^(E)Kg. 



This is a symmetric matrix that is positive semidefinite. 



(10) 



fill 



4 Penalized Partial Least Squares Regression 

We now introduce a general framework to combine PLS with penalization terms. We 
remark that this is not limited to spline transformed variables or to the special shape of 
the penalty matrix P that is defined in (11). For this reason, we present the new method 
in terms of the original data matrix X and only demand that P is a symmetric matrix 
such that Jp + P is positive definite. 

Again, we restrict ourselves to univariate responses y. Penalized PLS for multivariate 
responses is briefly discussed in Section 7. We modify the optimization criterion (5) of 
PLS in the following way. The first component ti = Xwi is defined by the solution of the 
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problem 

w'^X'^yy'^Xw 
argmax — — . (12) 

Using Lagrangian multipliers, we obtain the solution 

wi = MX^y (13) 

with M = (Ip + P)~^ . Subsequent weight vectors and components are computed by 
deflating X as described in (7) and then maximizing (12) with X replaced by Xi. In 
particular, we can compute the weight vectors and components of penalized PLS by simply 
replacing Wi = Xjy by (13) in algorithm 1. 

We now present results on penalized PLS that allow us to compute its regression vectors 
efficiently. Note that all results on penalized PLS also hold for ordinary PLS if we choose 
P = 0. Let 

T = {ti,...,tra) , W = {wi,...,Wm), 

denote the matrices of components and weight vectors respectively. 
Lemma 1. The matrix 



is upper bidiagonal, that is 

rij = tJXwj = 

if i < j or i + 1 > j. The matrix R is invertible. Furthermore, the columns of T and the 
columns of XW span the same space. 

This is an extension of a result for ordinary PLS that can be found e.g. in Manne 
(1987). The proof can be found in the appendix. We can now determine the regression 
coefficients for penalized PLS. 

Proposition 2. The Penalized PLS regression vector obtained after m steps is 

pl-pPLs = W {W^X'^XWy^W^X'^y. (14) 

In particular, the penalized PLS estimator is the solution of the constrained minimization 
problem 

rmn \\y - Xfif 
subject to ^ E span{wi, . . . ,Wm} ■ (15) 

Proof. We deduce from lemma 1 that the columns of Xw span the same space as the 
columns of T. As PLS is ordinary least squares regression with predictors ti, . . . ,tm, we 
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have 



y = VtV = Vxwy = XW [W^X'^XW] ^ W^X^y . 



The second statement can be proven by noting that the OLS minimization problem with 
constraints (15) is equivalent to an unconstrained minimization problem for /3 = Wa with 
a G W^. If we plug this into the formula for the OLS estimator, we obtain (14). □ 

Formula (14) is beneficial for theoretical purposes but it is computationally inefficient. 
We now show how the calculation can be done in a recursive and faster way. The key point 
is to find "effective" weight vectors Wi such that for every i 

ti = XiWi = Xwi . (16) 

This can be done by exploiting the fact that R is bidiagonal. 

Proposition 3. The effective weight vectors Wi defined in (16) and the regression vectors 
of penalized PLS are determined by setting Wq = and ^ = and computing iteratively 

wf_^X^Xwi ^ 

Wi = Wi - ^j, ^rj,^^ Wi-l , 



The proof can be found in the appendix. Combining this result with the PLS algorithm 
1, we obtain the penalized PLS algorithm 2. 

Algorithm 2 Penalized PLS 

Xi = X, number of components m, penalty matrix P (input) 

M = {Ip + Py^, Wq = 0, 3^°^ = (initialization) 
for i=l,. . . ,m do 

Wi = MXjy (weight vector) 

Wi = Wi — — x^^Xw (efiective weight vector) 

p = p + J, ^ Wi (regression vector) 

ti = XiWi (component) 
Xj+i = Xi- VtiXi (deflation) 
end for 



4.1 Partial Least Squares and Krylov Subspaces 

It is well-known that PLS is closely connected to Krylov subspaces and conjugate gradient 
methods. Quite generally, linear regression problems can be transformed into algebraic 
problems in the following way. The OLS estimator is the solution of the minimization 
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problem 



rmn ||y - X/3f . (17) 

This is equivalent to finding the solution of the associated normal equation 

Af3 = b (18) 

with b = X^y and A = X^X . If the matrix A is invertible, the solution of the normal 
equations is the OLS estimator f3 = A^b. If A is singular, the solution of (18) with 
minimal Euclidean norm is A~b. We already mentioned in Section 2 that in the case of 
high dimensional data, the matrix A is often (almost) singular and that the OLS estimator 
performs poorly on new data sets. A popular strategy is to regularize the least squares 
criterion (17) in the hope of improving the performance of the estimator. This corresponds 
to finding approximate solutions of (18). For example, Ridge Regression corresponds to 
the solution of the modified normal equations 

{A + XIp)P = b. 

Here A > is the Ridge parameter. Principal Components Regression uses the eigen 
decomposition of A 

p 

A = UAU^ = J2^i'^iuJ 

1=1 

and approximates A and b via the first m eigenvectors 

It can be shown that the PLS estimators are equal to the approximate solutions of the 
conjugate gradient method (Hestenes & Stiefel 1952). This is a procedure that iteratively 
computes approximate solutions of (18) by minimizing the quadratic function 

m = ^I3'^A(3 -0^h = \ (/3, AI3) - (/3, b) (19) 

along directions that are A-orthogonal. The approximate solution obtained after m steps 
is equal to the PLS estimator obtained after m iterations. 

The conjugate gradient algorithm is in turn closely related to Krylov subspaces and the 
Lanczos algorithm (Lanczos 1950). The latter is a method for approximating eigenvalues. 
The connection between PLS and these methods is well-elaborated in Phatak & de Hoog 
(2003). We now establish a similar connection between penalized PLS and the above 
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mentioned methods. Set 

Am = MA and bM = Mb. 

Recall that M is a symmetric and positive definite matrix that is determined by the 
penalty term P. We now illustrate that penalized PLS finds approximate solutions of the 
preconditioned normal equation 

AmP = bM- (20) 

Lemma 4. The space spanned by the weight vectors Wi, . . . ,Wm of penalized PLS is the 
same as the space spanned by the Krylov sequence 

bM,AMbM,---,A^^bM- (21) 

This is the generaUzation of a result for ordinary PLS and can be proven via induction. 
Details are given in the appendix. We denote by 

JCim) ^ ]C(-^) (AM,bM) 

the space that is spanned by the Krylov sequence (21). This space is called a Krylov space. 
Corollary 5. The penalized PLS estimator is the solution of the optimization problem 

min \\y-Xf3f 
subject to Pe'l&^\ 

Proof. This follows immediately from proposition 2 and the fact that the weight vectors 
span the Krylov space Al("^) . □ 

We now present the conjugate gradient method for the equation 

Am/3 = bM- (22) 

The Conjugate gradient method is normally applied if the involved matrix is symmetric. 
Note that in general, the matrix Am is not symmetric with respect to the canonical inner 
product, but with respect to the inner product 

{x,x)m-^ = M~^x 
defined by M~^. We can rewrite the quadratic function ^ defined in (19) as 

0(/3) = ^ {/3, Aa^/3)^_i - {fi,bM)M-^ ■ 
We replace the canonical inner product by the inner product defined by and minimize 
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this function iteratively along directions that are AA4--orthogonal. 

We start with an initial guess (3^ = and define do = Vq = hM — -^mPo = ^m- The 
quantity dm is the search direction and is the residual. For a given direction dm, we 
have to determine the optimal step size, that is we have to find 

am = arg min (j) {(3^ + adm) ■ 

a 

It is straightforward to check that 

{dmi'l'm) 

{dm-, -^Mdm) M-'^ 

The new approximate solution is then 
After updating the residuals via 

rm+l = bM - AMf^m+li 

we define a new search direction dm+i that is AM-orthogonal to the previous search direc- 
tions. This is ensured by projecting the residual onto the space that is Ajvf-orthogonal 
to do, • • • , dm- We obtain 



^ {di,AMdi)M-l 



Theorem 6. The penalized PLS algorithm is equal to the conjugate gradient algorithm for 
the preconditioned system (20). 

The presentation of the conjugate gradient method above and the proof of its equiv- 
alence to penalized PLS are an extension of the corresponding results for PLS that is 
given in Phatak & de Hoog (2003). The proof can be found in the appendix. Note that 
there is a different notion of conjugate gradients for preconditioned systems (Golub & van 
Loan 1983). We transform the preconditioned equation (19) by postmultiplying with M: 

MAMP = Mb with P = M-^f3. 

As the matrix MAM is symmetric, we can apply the ordinary conjugate gradient algo- 
rithm to this equation. This approach differs from the one described above. 

Proposition 7. Suppose that A = X'^X is regular. After at most p iterations, the 
penalized PLS estimator equals the OLS estimator. 
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Proof. Using (15), the above statement is equivalent to showing that 

0OLS e 

Hence, we have to show that there is a polynomial tt of degree < p — 1 such that Pols — 
TT (-^A^) ■ As M is invertible, the OLS estimator is 

3oL5 = A-^b = A-^M-^ ■ Mb = (MA)-^ Mb = A^bM ■ 

As Am is the product of two symmetric matrices and M is positive definite. Am has a 
real eigendecomposition. 

Am = UTU-^ 

We define the polynomial tt via the at most p equations 

7i"(7i) = — • 

It follows immediately that tt^Am) = ^m- This concludes the proof. □ 
4.2 Kernel Penalized Partial Least Squares 

The computation of the penalized PLS estimator as presented in algorithm 2 involves 
matrices and vectors of dimension p x p and p respectively. If the number of predictors p 
is very large, this leads to high computational costs. In this subsection, we show that we 
can represent this algorithm in terms of matrices and vectors of dimension n X n and n 
respectively. Let us define the n x n matrix Km via 

Km = {{xi,Xj)M) = XMX^ . 

This matrix is called the Gram matrix or the kernel matrix of X. We conclude from 
corollary 5 that the penalized PLS estimator obtained after m steps is an element of the 
Krylov space 1C^'^\Am-i ^m)- It follows that we can represent the penalized PLS estimator 
as 

3^""^ = MX^a("*) , G K.^"^) {Km^v) ■ 

Here, the Krylov space KS'^'> {Km-, u) is the space spanned by the vectors 

y,KMy,...,KZ-^y. 

Analogously, we can represent the effective weight vectors by 

= MX^5^ , Sim e lC(^^KM,y) . 



14 



It follows from the definition of the deflation step that 

Xly = (I„ - Pt,...,t._,) y = x\y- . 
We conclude that the weight vector Wi is simply 

If we plug in these representations into the penalized PLS algorithm 2, we obtain algorithm 
3 that depends only on Km and y. 

Algorithm 3 Kernel penalized PLS 

X, y, number of components m, penalty term P (input) 

M = {Ip + Py\ Km = XMX^, a(0) = a^") = (initialization) 

for i=l,. . . ,m do 

yiTs = y - y^*""^) (residuals) 

am = yrfs - '^.p-^^MVr^s (effective weight vector) 

~ T 

q;(H = q;(™'~i) + ^'^y dim (regression vector) 

ti = KMCt-m (component) 

g{m+i) ^ -(m) _^ j)^ y (estimation of y) 

end for 



A kernel version of PLS has already been defined in Rannar et al. (1994) in order 
to speed up the computation of PLS. We repeat that tlie speed of the kernel version of 
penalized PLS does not depend on the number of predictor variables at all but on the 
number of observations. This implies that - from an algorithmic point of view - there are 
no restrictions in terms of the number of predictor variables. The importance of this so- 
called "dual" representation also becomes apparent if we want to extend PLS to nonlinear 
problems by using the kernel trick. In this paper, the kernel trick appears in two different 
versions. 

Let us only consider the case of ordinary PLS on B-Splines transformed variables. 
Recall that in (9), we transform the original data X using a nonlinear function $ defined 
by the B-Splines. As algorithm 3 only relies on inner products between observations, 
the nonlinear transformation does not increase the computational costs. We only have to 
compute the kernel matrix of inner products 

K = mx,)Mx,))\,=l,...,n- 

This implies that we do not have to map the data points explicitly using a function It 
suffices to compute the function 

k{x,x) = ($(a;),$(x)) . (23) 
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The function k is called a kernel. The replacement of the usual inner product by kernel is 
known as the kernel trick and has turned up to be very popular in the machine learning 
community. Instead of defining a nonlinear map we define a "valid" kernel function 
k{x,z). E.g., polynomial relationships can be modeled via kernels of the form 

kdix,z) = il + {x,z))\deN. 

Furthermore, it is possible to define kernels for complex data structures as graphs or text. 
Literature on the kernel trick and its applications is abundant. A detailed treatise of the 
subject can be found in Scholkopf & Smola (2002). A nonlinear version of PLS using the 
kernel trick is presented in Rosipal & Trejo (2001). 

If we represent penalized PLS in terms of the kernel matrix Km, we realize that 
penalized PLS is closely connected to the kernel trick in other respects. Using algorithm 3 
or the definition of the kernel matrix Km, we realize that penalized PLS equals ordinary 
PLS with the canonical inner product replaced by the inner product 

{x, z)m = x^Mz . 

This function is called a linear kernel. Why is this a sensible inner product? Let us consider 
the eigendecomposition of the penalty matrix, P = S@S^ . We prefer direction s such 
that s'^Ps is small, that is we prefer directions that are defined by eigenvectors Sj of P 
with a small corresponding eigenvalue 9i. If we represent the vectors x and z in terms of 
the eigenvectors of P, 

x = S^x , z = S'^z, 

we conclude that 

P I _ _ 

{x, z)m = x^ {Ip + @) ^ 2 = X] 1 _L a ^^^^ 

i=i ^ + 

This implies that directions Sj with a small eigenvalue 0i receive a higher weighting than 
directions with a large eigenvalue. 

5 Example: Birth Data 

In this section, we analyze a real data set describing pregnancy and delivery for 42 infants 
who are sent to a neonatal intensive care unit after birth. The data are taken from the 
R (R Development Core Team 2005) software package exactmaxsel and are introduced 
in Boulesteix (2006). Our goal is to predict the number of days spent in the neonatal 
intensive care unit (y) based on the following predictors: birth weight (in g), birth height 
(in cm) , head circumference (in cm) , term (in week) , age of the mother (in year) , weight of 
the mother before pregnancy (in kg), weight of the mother before delivery (in kg), height 
of the mother (in cm), time (in month). Some of the predictors are expected to be strongly 
associated with the response (e.g., birth weight, term), in contrast to poor predictors like 
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time or height of the mother. 

The parameter settings are as follows. We make the simplifying assumption that A = 
Ai = . . . = Ap, which reduces the problem of selecting the optimal smoothing parameter to a 
one- dimensional problem. As abeady mentioned above, we use cubic splines. Furthermore, 
the order of difference of adjacent weights is set to 2. The shape of the fitted functions 




1000 1500 2000 2500 3000 3500 4000 1000 1500 2000 2500 3000 3500 



Figure 1: Fitted function for the predictor variable "weight" using penalized PLS. The 
value of A is 2000 and the numbers of components are 1,5 (top) and 9, 13 (bottom). 

fj depends on the two model parameters A and m. We first illustrate that the number 
m of penalized PLS components controls the smoothness of the estimated functions. For 
this purpose, we only consider the predictor variable "weight". Figure 1 displays the fitted 
functions obtained by penalized PLS for A = 2000 and 4 different numbers of components 
m = 1,5, 9, 13. For small values of m, the obtained functions are smooth. For higher values 
of m, the functions adapt themselves more and more to the data which leads to overfitting 
for high values of m. 

We compare our novel method to PLS without penalization as described in (Durand 
2001) and the gam() package in R. This is the implementation of an adaptive selection 
procedure for the basis functions in (2). More details can be found in Wood (2000) and 
Wood (2006). This is the standard tool for estimating generalized additive models. The 
optimal parameter values of (penalized) PLS are determined by computing the leave-one- 
out squared error. We remark that the split into training and test set is done before 
transforming the original predictors using B-splines. In order to have comparable results, 
we normalize the response such that var(y) = 1. The results are summarized in Table 1. 
Penalized PLS is the best out of the three method. In particular, it receives a considerably 
lower error than PLS without penalization. 
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leave-one-out-error 


rriopt 




PLS 


0.159 


8 




penalized PLS 


0.090 


2 


330 


GAM 


0.115 







Table 1: Optimal model parameters and leave-one-out error for the birth data set and 
normalized response. 

6 Example: Polymer Data 

This data set consists oi p = 10 predictor variables and four response variables. The 
number of observations is n = 61. The data are taken from a polymer test plant. It 
can be downloaded from ftp://ftp.cis.upenn.edu/pub/ungar/chemdata/. The predic- 
tor variables are measurements of controlled variables in a polymer processing plant (e.g. 
temperatures, feed rates ...). No more details on the variables are given due to confi- 
dentiality reasons. As in the last section, we first scale each response variable to have a 
variance equal to 1. Again, we compare penalized PLS to PLS and gamO. The results are 
summarized in Table 2. For all four response variables, penalized PLS is better than PLS 





response 


2"'-' response 


'■'/'^ response 


1*'' response 


PLS 


0.672 


0.863 


0.254 


0.204 


penalized PLS 


0.607 


0.801 


0.206 


0.164 


GAM 


0.599 


0.881 


0.218 


0.182 



Table 2: Leave-one-out error for the polymer data set and normalized response. 



without penalization. Penalized PLS is also better that GAM for three out of the four 
response variables, although the difference is considerably smaller. 

7 Concluding Remarks 

In this work, we proposed an extension of Partial Least Squares Regression using penal- 
ization techniques. Apart from its computational efficiency (it is virtually as fast as PLS), 
it also shares a lot of mathematical properties of PLS. Our novel method obtains good 
results in applications. In the two examples that are discussed, penalized PLS clearly 
outperforms PLS without penalization. Furthermore, the results indicate that it is a com- 
petitor of gamO in the case of very high-dimensional data. 

We might think of other penalty terms. Kondylis & Whittaker (2006) consider a 
preconditioned version of PLS by giving weights to the predictor variables. Higher weights 
are given to those predictor variables that are highly correlated to the response. These 
weights can be expressed in terms of a penalty term. Goutis & Fearn (1996) combine 
PLS with an additive penalty term to data derived from near infra red spectroscopy. The 
penalty term controls the smoothness of the regression vector. 
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The introduction of a penalty term can easily be adapted to other dimension reduction 
techniques. For example for Principal Components Analysis, the penalized optimization 
criterion is 

vai(Xw) 

max ■ 



w w^w + w^Pw 

PLS can handle multivariate responses Y. The natural extension of criterion (3) is the 
following. 

||cov(Xiti,l")f w^X^YY^Xw 
max = max . 

Using Lagrangian multipliers, we deduce that the solution is the eigenvector of the matrix 

B = X'^YY^X 

that corresponds to the largest eigenvalue of B. This eigenvector is usually computed in 
an iterative fashion. If we want to apply penalized PLS for multivariate responses, we 
compute 

w'^X'^YY'^Xw 
max rj-i r% ' 

w w^w + w^ Pw 

The solution fulfills 

Bw = 7 (Ip + P) w, 7 G M . 

This is called a generalized eigenvalue problem or a matrix pencil. Note that for multivari- 
ate y, the equivalence of SIMPLS and NIPALS does not hold, so we expect the penalized 
versions of these methods to be different as well. There are kernel versions for PLS with 
multivariate Y (Rannar et al. 1994, Rosipal & Trejo 2001), hence we can also represent 
multivariate penalized PLS in terms of kernel matrices. 
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A Proofs 



We recall that for fc < i 



i-l 



X^ = ll {In - Vt,) Xk = {In " ^t„...,t,_0 X, 
j=k 



(24) 



The last equaUty follows from the fact that the components tj are mutually orthogonal. In 
particular, we obtain 



{ln-Vt,,...,U.,)X. 



(25) 



Proof of lemma 1. First note that (25) is equivalent to X = Xj +Vti,...,tj-iX . It follows 
that 



As all components tj are mutually orthogonal, 



(26) 



i=l 



tJXWj 



tJtij^O ,i = j 

,i>j 

* , otherwise 



We conclude that R is an upper triangular matrix with all diagonal elements ^ 0. Further- 
more, it follows from (26) that all vectors Xwj are linear combinations of the components 
ti, . . . ,tj. This implies that the columns of XW and the columns of T span the same space. 
Finally, we have to show that R is bidiagonal. To prove this, we show that XiWj = 
for j < i. The condition i > j implies (recall (24)) that Xi = Xj — Vtj,...,ti-iXj and 
consequently 



XiWj = XjWj - Vti,...,ti-iXjWj = tj - Vti,...,ti-i^j 



This implies that for z — 1 > _7 



j<i-i 



tj-tj = 0. 



tJXwj 



= tJ [Xi + Vt,,...,u_,X)wj 
= tJ {XiWj + Vt„...,u_,Xwj) 
= tf{Pt„...,u.,Xwj)=0. 



□ 



Proof of proposition 3. For i = 1, we have Wi = Wi as X\ = X. For a general i, we have 
tj+i = Xi+iWi+i = {X - Vti,...,tiX) Wi+i = Xwi+i - VtiXwi+i . 
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The last equality holds as R = T^XW is bidiagonal. Using formula (6) for the projection 
operator, it follows that 



wf A ^ Xwi 



We conclude that 



wl_X^Xw,+, 



X^ Xwi 



I 



The regression estimate after i + 1 steps is 

= X^^'^+Vu+.Y 

= X(3^' +X Zt^ - 

wf^^X^ Xwi+i 

This concludes the proof. □ 

Proof of lemma 4- We use induction. For m = 1 we know that Wi = bM- For a fixed 
m > 1, we conclude from the induction hypothesis and lemma 1 that every vector s that 
lies in the span of ti, . . . , is of tlie form 

s = Xv , V G span{«;i, . . . , Wm} = KS'^'> . (27) 

We conclude that 

Xl^^y = {X- Vt„...,tmXfy = X'^y - X^Vt„...,t^y b - X^Xs 

and that 

Wm+i = MX'i^^y = Mb- MAs = bM - Ams E KP^^^^ . 

□ 

In the rest of the appendix, we show the equivalence of penalized PLS and the precon- 
ditioned conjugate gradient method. 

Lemma 8. We have 

span{do,...,dm-i} = span {ro, . . . , r^-i} = span{xi, . . . ,Xm} = . 

This can be proven via induction. 
Lemma 9. We have 
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Proof. This corresponds to the iterative definition of /3^+i- We only have to show that 

{di,ri)]^-i = {di,bM)M-'^ ■ 

Note that 

ri = b - y^^ajAMdj . 

j=0 

As di is Ajv^-orthogonal onto all directions dj , j < i, the proof is complete. □ 

Now we are able to proof the equivalence of penalized PLS and the conjugate gradient 
method. 

Proof of theorem 6. As the search directions dj span the Krylov space K^'^^ , we can replace 
the matrix W in (14) by the matrix D = (do, . . . ,dm-i)- As the search directions are 
Ajvf-orthogonal, we have 

Pppj^s = D{D^AD) 'D^b 

= D [D'^ M-^ AmDY^ D'^ M-^bM 

^ {di, AMdi) f^-i 

and this equals the formula in lemma 9. □ 
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